Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for UFloat in PintArray (#139) #140

Open
wants to merge 58 commits into
base: master
Choose a base branch
from

Conversation

MichaelTiemannOSC
Copy link
Collaborator

@MichaelTiemannOSC MichaelTiemannOSC commented Oct 15, 2022

Signed-off-by: MichaelTiemann 72577720+MichaelTiemannOSC@users.noreply.github.com

Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
@MichaelTiemannOSC
Copy link
Collaborator Author

As noted, this change request may precipitate changes needed in how Pint deals with uncertainties (whose representation is not supported by vanilla Python parsers).

@andrewgsavage
Copy link
Collaborator

This lets you store ufloat objects, but I'm not sure how well pandas will for operations like reshaping, indexing etc. I suspect some will work but some won't.

I think you can add ufloats to fixtures in test_pandas_extensiontests and it'll run with standard and ufloat quantities. This should show what works and doesn't.

I think you should make it so that it only stores floats or ufloats in a PintArray. There's a few points where arrays are expected to contain the same object in them, and I think pandas expects each extensiontype to only store one type of object.

I also think (for the time being) you should make it only allow floats or ufloats globally, as there's no way to distinguish between a pint[m] (float) or pint[m] (ufloat) type. This is a wider issue that also applies to ints or other numeric types.

@MichaelTiemannOSC
Copy link
Collaborator Author

I'm relatively new to all the pytest stuff in pint/pandas, etc. Am I understanding correctly that I should modify

@pytest.fixture
def data():
    return PintArray.from_1darray_quantity(np.arange(start=1.0, stop=101.0) * ureg.nm)


@pytest.fixture
def data_missing():
    return PintArray.from_1darray_quantity([np.nan, 1] * ureg.meter)


@pytest.fixture
def data_for_twos():
    x = [
        2.0,
    ] * 100
    return PintArray.from_1darray_quantity(x * ureg.meter)

to ufloat-based values? I'm going to give that a try, but if there's a better way to do it (or a you meant something different) please guide me. Thanks!

@MichaelTiemannOSC
Copy link
Collaborator Author

MichaelTiemannOSC commented Oct 16, 2022

Progress...I've now got 203 failures, the vast majority of which are due to the fact that ufloats that are computed to be the same value, but not the same variable, do not compare as equal, so all the assert_series_equal, assert_extension_array_equal, assert_frame_equal fail, through no fault of their own. It will be quite some work to plug in an appropriate "assert_*_nominally_equal` everywhere.

I do not know how to make progress on this:

__________________________________________________________________________________________________________________ ERROR at setup of TestMethods.test_insert_invalid __________________________________________________________________________________________________________________
file /Users/michael/opt/miniconda3/envs/pandas/lib/python3.9/site-packages/pandas/tests/extension/base/methods.py, line 558
      def test_insert_invalid(self, data, invalid_scalar):
E       fixture 'invalid_scalar' not found
>       available fixtures: all_arithmetic_operators, all_boolean_reductions, all_compare_operators, all_data, all_numeric_reductions, as_array, as_frame, as_series, box_in_series, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, data, data_for_grouping, data_for_sorting\
, data_for_twos, data_missing, data_missing_for_sorting, data_repeated, doctest_namespace, dtype, fillna_method, groupby_apply_op, monkeypatch, na_cmp, na_value, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, sort_by_key, tmp_path, tmp_p\
ath_factory, tmpdir, tmpdir_factory, use_numpy
>       use 'pytest --fixtures [testpath]' for help on them.

/Users/michael/opt/miniconda3/envs/pandas/lib/python3.9/site-packages/pandas/tests/extension/base/methods.py:558
_________________________________________________________________________________________________________________ ERROR at setup of TestSetitem.test_setitem_invalid __________________________________________________________________________________________________________________
file /Users/michael/opt/miniconda3/envs/pandas/lib/python3.9/site-packages/pandas/tests/extension/base/setitem.py, line 437
      def test_setitem_invalid(self, data, invalid_scalar):
E       fixture 'invalid_scalar' not found
>       available fixtures: all_arithmetic_operators, all_boolean_reductions, all_compare_operators, all_data, all_numeric_reductions, as_array, as_frame, as_series, box_in_series, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, data, data_for_grouping, data_for_sorting\
, data_for_twos, data_missing, data_missing_for_sorting, data_repeated, doctest_namespace, dtype, fillna_method, full_indexer, groupby_apply_op, monkeypatch, na_cmp, na_value, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, sort_by_key, t\
mp_path, tmp_path_factory, tmpdir, tmpdir_factory, use_numpy
>       use 'pytest --fixtures [testpath]' for help on them.

/Users/michael/opt/miniconda3/envs/pandas/lib/python3.9/site-packages/pandas/tests/extension/base/setitem.py:437

Also: 191 passed, 1 skipped, 69 xfailed, 5 xpassed

@andrewgsavage
Copy link
Collaborator

andrewgsavage commented Oct 16, 2022

I'm relatively new to all the pytest stuff in pint/pandas, etc. Am I understanding correctly that I should modify

yes like that, there are some fixtures that return a value from a list, like all_arithmetic_operators. I'd try changing these so they test both ufloat and float.

@andrewgsavage
Copy link
Collaborator

andrewgsavage commented Oct 16, 2022

Use this fork of pandas for the assert...equal tets to work

pandas: ["git+https://github.com/andrewgsavage/pandas.git@assert_almost_equal"]

edit: I'm not 100% sure this will work, it makes the tests work for quantity but you might have a different issue there

@andrewgsavage
Copy link
Collaborator

fixture 'invalid_scalar' not found

that's coming from a newer pandas with extra tests compared pandas 1.4x, you can rebase from my #133 if you're using a higher version... but for now I wouldn't worry about those errors

@MichaelTiemannOSC
Copy link
Collaborator Author

MichaelTiemannOSC commented Oct 17, 2022

More progress...down to essentially one kind of failure--the un-hashabiliity of UFloat. I don't have the Python sophistication to know what the next step would be, but I'll push the changes I have thus far.

=== short test summary info ================================================================================
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestGroupby::test_groupby_extension_agg[True] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestGroupby::test_groupby_extension_agg[False] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestGroupby::test_groupby_agg_extension - AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="B") are different
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestGroupby::test_groupby_extension_no_sort - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestGroupby::test_groupby_extension_transform - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestGroupby::test_groupby_extension_apply[scalar] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestGroupby::test_groupby_extension_apply[list] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestGroupby::test_groupby_extension_apply[series] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestGroupby::test_groupby_extension_apply[object] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestGroupby::test_in_numeric_groupby - AssertionError: Did not see expected warning of class 'FutureWarning'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestInterface::test_contains - AssertionError
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_value_counts_with_normalize - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_sort_values_frame[True] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_sort_values_frame[False] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_factorize[-1] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_factorize[-2] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_factorize_equivalence[-1] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_factorize_equivalence[-2] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_hash_pandas_object_works[True] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_hash_pandas_object_works[False] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_value_counts[data-True] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_value_counts[data-False] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_value_counts[data_missing-True] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_value_counts[data_missing-False] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_unique[<lambda>-Series] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_unique[<lambda>-<lambda>] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_unique[unique-Series] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_unique[unique-<lambda>] - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestArithmeticOps::test_divmod_series_array - TypeError: ufunc 'divmod' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''...
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestArithmeticOps::test_divmod - TypeError: ufunc 'divmod' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestReshaping::test_merge_on_extension_array - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestReshaping::test_merge_on_extension_array_duplicates - TypeError: unhashable type: 'AffineScalarFunc'
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestSetitem::test_setitem_scalar_key_sequence_raise - Failed: DID NOT RAISE <class 'ValueError'>
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestSetitem::test_setitem_frame_2d_values - AssertionError: Caused unexpected warning(s): [('UnitStrippedWarning', UnitStrippedWarning('The unit of the quantity is stripped when downcasting to ndarray.'), ...
FAILED pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestSetitem::test_delitem_series - TypeError: unhashable type: 'AffineScalarFunc'
ERROR pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestMethods::test_insert_invalid
ERROR pint-pandas/pint_pandas/testsuite/test_pandas_extensiontests.py::TestSetitem::test_setitem_invalid
==== 35 failed, 331 passed, 48 xfailed, 26 xpassed, 268 warnings, 2 errors in 15.68s =======================

@MichaelTiemannOSC
Copy link
Collaborator Author

Does it make sense to rebase PintArrays on uarrays (which can be tuned to use numpy.ndarray when magnitudes are np.floating and use unumpy.uarray when they are UFloat? This notebook shows how easily things can be changed on the fly: https://github.com/Quansight-Labs/unumpy/blob/master/notebooks/01_user_facing.ipynb

@MichaelTiemannOSC
Copy link
Collaborator Author

I think you should make it so that it only stores floats or ufloats in a PintArray. There's a few points where arrays are expected to contain the same object in them, and I think pandas expects each extensiontype to only store one type of object.

I'll give that a try. While I'm not sure it's toally necessary, I think it's a good exercise.

I also think (for the time being) you should make it only allow floats or ufloats globally, as there's no way to distinguish between a pint[m] (float) or pint[m] (ufloat) type. This is a wider issue that also applies to ints or other numeric types.

It will certainly keep the discussion with Pandas on point to present uniform ExtensionTypes and to keep both Pint and Pandas on the straight-and-narrow there.

@andrewgsavage
Copy link
Collaborator

Does it make sense to rebase PintArrays on uarrays

you can try, it's easy to check. I don't know what effects that'd have. Numpy is used accross pandas so would still be used.

Big remaining problem is unhanshable UFloat type.

Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Signed-off-by: 72577720+MichaelTiemannOSC@users.noreply.github.com
@@ -24,6 +33,120 @@

ureg = PintType.ureg

import pandas._testing as tm
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note about these changes: when comparing UFloats, if the .s (std_dev) terms zero, then two numbers with the same .n (nominal_value) terms will compare equal. However, if there are uncertainties in the numbers, even identical uncertainties, they will not compare equal unless they are identically the same uncertain variable. Since the test suite typically finds two ways to do the same thing and then compare the results, this leads to UFloats comparing not equal unless the uncertainties are zero.

I started making chnages to try to compare UFloats by comparing their .n and .s separately (not using ==), but this is a real hassle, as there are many, many cases to test and unpack. For my own sanity, I simplified the task by using only UFloats with uncertainty values of zero, allowing me to test float vs. UFloat handling without the spurious problems with ufloat(a) =?= ufloat(b).

I've left the code I started writing for the more general relaxation of ufloat comparison, but it shouldn't actually be needed if we keep our unfloat uncertainty at zero (which is presently the case). So we can either remove, improve, or defer dealing with uassert_equal and friends. LMK

Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Big remaining problem is unhanshable UFloat type.

Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Signed-off-by: 72577720+MichaelTiemannOSC@users.noreply.github.com
Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
…C/pint-pandas into ducks-unlimited

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
@MichaelTiemannOSC
Copy link
Collaborator Author

I thought it would be helpful to me to merge in changes that have been merged upstream, not realizing how that would affect the size of the PR. If I should revert to the previous version so that the PR has only my changes, I'll do that. I'm still learning my way around git, and everything new I do is truly a "learning experience".

Deal with duality of np.nan and _ufloat_nans when constructing PintArrays, creating float64 magnitude arrays when possible and ufloat magnitudes when necessary.  Note that PintArray columns don't advertise whether magnitudes are np.float64 or np.obejct; they are what they are.  Also commented out warning, which simply creates noise.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Fix conditional expression.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Don't try to build a large-enough-to-hold-uncertainties array if we are not using uncertainties in any way.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Allow pip install command to unpack uncertainties option into multiple arguments.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
@MichaelTiemannOSC
Copy link
Collaborator Author

Thanks @hgrecco for creating Pint-0.23rc0. I've attempted to write some rules to optionally test uncertainties, but I don't know the right way to leave the normal Pint-Pandas matrix to do its thing, and then create an uncertainties test that uses Pandas-2.1.0 and Pint-0.23rc0. The approach I took simply overrode Pandas and Pint every time. Maybe I need to add a null string "" and then test for the null string? My yaml/shell scripting is fairly primitive.

In any case, if I can get the CI/CD working as it should, then these changes should be ready to go for a pint_pandas-0.6rc0.

By using `include` directive, we can add a special case to the matrix of possibilities.  In this case we add to the matrix the case where pandas>=2.1.0 and pint>=0.23rc0, which are both necessary for the UFloat array changes to work.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Don't test uncertainties in ci-pint-pre or ci-pint-master, as these don't have the necessary versions of Pandas or Pint to make the matrix include operation work.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Also test Pint-Pandas without uncertainties.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Trying again with different syntax to avoid duplicate key issue.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Trying yet another syntax to make `uncertainties` one of two options when pandas==2.1.0 and pint==0.23rc0.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Try using null instead of "" for uncertainties that we don't want to run.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Try using "null" to represent a condition where we don't want uncertainties installed.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Try using "no-thank-you" instead of "null" for matrix permutations not intended to use uncertainties.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Wrap `uncertainties` matrix parameter in `[]`.  If we get this working, can try setting parameter to `''`' from several commits ago.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Use single quotes when testing the value of `uncertainties`.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Tweak conditional matrix logic by using a "same-but-different" value for `pint` value assignment.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Untabify yaml file...

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Make uncertainties a default null value in the matrix.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Attempt to explicitly add python-version and numpy to `uncertainties` case.  Otherwise, pytest never gets installed.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
We cannot reference the matrix when building the matrix, so unroll all parameter values.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Remove unused `uncertainties` from -pre and -master CI files.

Only test one configuration of Python against uncertainties (trying to work around duplicate key problem).

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
@MichaelTiemannOSC
Copy link
Collaborator Author

We now have CI/CD testing without uncertainties across a matrix, and additionally testing uncertainties in one specific matrix case (Python 3.9, Pandas 2.1.0, Pint 0.23rc0). That should make us good to go for an 0.6rc0, no?

@MichaelTiemannOSC
Copy link
Collaborator Author

Nudge @andrewgsavage

@andrewgsavage
Copy link
Collaborator

Sorry whenever I look at this I see that uses a numpy object type array, which I didn't expect to work... and get hung up thinking about well it works.

There aren't any examples using an array of objects, so although it works, it's not really supported. Merging this implies support, which I'm not really sure is a good idea. The code wasnt written with that in mind, and other than your issue test there's no testing for this a mix of different objects in numpy array.

Does it make sense to rebase PintArrays on uarrays (which can be tuned to use numpy.ndarray when magnitudes are np.floating and use unumpy.uarray when they are UFloat? This notebook shows how easily things can be changed on the fly: https://github.com/Quansight-Labs/unumpy/blob/master/notebooks/01_user_facing.ipynb

reading this through again, that's kind of what I meant by making a seperate ExtenisonArray for uncertainties, say UncertaintyArray. The PintArray can then use the UncertaintyArray for the data, as it does with other pandas EAs. The extra logic needed to handle uncertainties you've written in this PR can be moved to the UA and the PintArray can deal with the units and delegate the rest to the UA magnitude (like how pint's Quantity does so)

I'm not sure how well a UA would work with pint-pandas (it may still need some code to handle uncertainties) but would like to see how well that works before deciding on this.

@MichaelTiemannOSC
Copy link
Collaborator Author

MichaelTiemannOSC commented Nov 12, 2023

Sorry whenever I look at this I see that uses a numpy object type array, which I didn't expect to work... and get hung up thinking about well it works.

Me too 😆

There aren't any examples using an array of objects, so although it works, it's not really supported. Merging this implies support, which I'm not really sure is a good idea. The code wasnt written with that in mind, and other than your issue test there's no testing for this a mix of different objects in numpy array.

Having spent a fair bit of time inside the Pandas code, especially transitioning from 1.5.x to 2.x, they really have nicely generalized support for object types in NumPy arrays. I started off needing a few changes to Pandas to support this implementation initially, but over time every single change I originally needed was rendered unnecessary. I did manage to get some changes merged upstream (and have another one in the works, see pandas-dev/pandas#55825). That latter one is not because numpy doesn't tolerate object types, but because Pint Quantities have their own opinions about being put into arrays.

Does it make sense to rebase PintArrays on uarrays (which can be tuned to use numpy.ndarray when magnitudes are np.floating and use unumpy.uarray when they are UFloat? This notebook shows how easily things can be changed on the fly: https://github.com/Quansight-Labs/unumpy/blob/master/notebooks/01_user_facing.ipynb

reading this through again, that's kind of what I meant by making a seperate ExtenisonArray for uncertainties, say UncertaintyArray. The PintArray can then use the UncertaintyArray for the data, as it does with other pandas EAs. The extra logic needed to handle uncertainties you've written in this PR can be moved to the UA and the PintArray can deal with the units and delegate the rest to the UA magnitude (like how pint's Quantity does so)

My initial thoughts are informed by the uncertainties implementation (which of course does use uarrays). Because of the way uncertainties wrap and unwrap everything, it's super-inconvenient to treat them as a first-class type that can live in an ExtensionArray. Maybe I didn't approach that correctly the first time, but I did try it and I couldn't make it work. Which is why I went with the flow and decided to let it do its magic on the magnitudes.

One of the very last things I needed to do was to ensure that the block managers would not mess up allocation of arrays that initially contained only floats, but which would later need to contain an uncertainty. I realized this was the same problem if one started with an array of float64 and then something later changed one entry to a complex128. I went big and added comprehensive complex testing to Pandas (pandas-dev/pandas#54761) which did indeed expose the same problem. So I proposed using the uncertainties changes to solve the complex128 problems. The Pandas folks fixed that by always allocating a fresh array if something didn't fit, and those changes were no longer neccessary.

I'm not sure how well a UA would work with pint-pandas (it may still need some code to handle uncertainties) but would like to see how well that works before deciding on this.

Happy to continue the discussion, and happy to prototype. Just wanted to share my experiences so far.

@andrewgsavage
Copy link
Collaborator

there is this
https://github.com/varchasgopalaswamy/AutoUncertainties/blob/main/auto_uncertainties/pandas_compat.py

does it do what you need on the uncertanty progopation side?

@MichaelTiemannOSC
Copy link
Collaborator Author

I reviewed that approach in this comment: hgrecco/pint#1797 (comment)

Which is why I re-upped my suggestion to merge the changes I developed instead. If I missed something, I would be happy to look again.

@andrewgsavage
Copy link
Collaborator

I made a start at this, enough to show what I'm thinking with code:

arr = np.array([ufloat(1, 0.01), ufloat(2, 0.1)])
arr
array([1.0+/-0.01, 2.0+/-0.1], dtype=object)

ua = UncertaintyArray(arr,np.array([False,False]))
ua
<UncertaintyArray>
[1.000+/-0.010, 2.00+/-0.10]
Length: 2, dtype: <class 'pint_pandas.uncertainty_array.UncertaintyType'>

pa = PintArray(ua,dtype="m")
pa
<PintArray>
[1.000+/-0.010, 2.00+/-0.10]
Length: 2, dtype: pint[meter]

pa.data
<UncertaintyArray>
[1.000+/-0.010, 2.00+/-0.10]
Length: 2, dtype: <class 'pint_pandas.uncertainty_array.UncertaintyType'>

Main benefits to this approach are:

  • an EA for uncertainties that can be used without units/pint, so can be used by more people
  • Uncertainty specific code lives in the UncertaintyArray (I hope!)

My initial thoughts are informed by the uncertainties implementation (which of course does use uarrays). Because of the way uncertainties wrap and unwrap everything, it's super-inconvenient to treat them as a first-class type that can live in an ExtensionArray. Maybe I didn't approach that correctly the first time, but I did try it and I couldn't make it work. Which is why I went with the flow and decided to let it do its magic on the magnitudes.

Do you have an example of this? I'm not familiar with uncertainties.


I started off by inherting from BaseMaskedArray, which uses a mask to know where the nans are. This may help with the nans issue, or it may not be needed. It's also not public... If avoiding the nans issue isn't needed it may be better to reserrect pandas-dev/pandas#45544

@MichaelTiemannOSC
Copy link
Collaborator Author

Who will be maintaining UncertaintyArray? Is this something we need to get the Pandas folks to adopt? Numpy? Pint depends on the uncertainties package (for Measurement). Getting Pandas to support uncertainties as a dtype would be quite a lift (see pandas-dev/pandas#49181 (review)). It could be that some day in the future, UncertaintyArray could be merged from Pint/Pint-Pandas into Pandas, but I think we have to assume that the work needs to start within our sphere, not theirs.

As for the wrapping/unwrapping, a better way to say this is that uncertainties does is magic not by how it specializes arrays but by wrapping all the NumPy ufuncs that operate on uncertainty values (though as NumPy and uncertainties have diverged there are unintentional gaps (see lmfit/uncertainties#163 (comment)).

unumpy arrays are not derived from numpy arrays, they wrap/reimplement everything. Which is why we have the ugliness of unp.isna(uarray) vs. pd.isna(PandasArray). It seems like UncertaintyArray could implement the necessities of uarrays while providing a pandas-friendly interface so we can use isna naturally. Which would be great.

Let me know if you'd like to share a branch and collaborate. My code certainly exercises uncertainties, Pint, and Pint-Pandas!

@andrewgsavage
Copy link
Collaborator

andrewgsavage commented Nov 14, 2023

Who will be maintaining UncertaintyArray?

It should either be in uncertainties or in another module, uncertainties-pandas.

Is this something we need to get the Pandas folks to adopt? Numpy?

Shouldn't, but may need changes like you found for pandas.

Pint depends on the uncertainties package (for Measurement).

Can't remember exactly what the plans were, but I remember reading something, possibly to move that to another module... but noones got around to it in forever.

Getting Pandas to support uncertainties as a dtype would be quite a lift (see pandas-dev/pandas#49181 (review)). It could be that some day in the future, UncertaintyArray could be merged from Pint/Pint-Pandas into Pandas, but I think we have to assume that the work needs to start within our sphere, not theirs.

Yea I don't see that happening in a long time either. There's a few pandas related issues under uncertainties so I think that's where to make a PR. It may be better in another package to prevent a dependency on pandas.

As for the wrapping/unwrapping, a better way to say this is that uncertainties does is magic not by how it specializes arrays but by wrapping all the NumPy ufuncs that operate on uncertainty values (though as NumPy and uncertainties have diverged there are unintentional gaps (see lebigot/uncertainties#163 (comment)).

unumpy arrays are not derived from numpy arrays, they wrap/reimplement everything. Which is why we have the ugliness of unp.isna(uarray) vs. pd.isna(PandasArray). It seems like UncertaintyArray could implement the necessities of uarrays while providing a pandas-friendly interface so we can use isna naturally. Which would be great.

Let me know if you'd like to share a branch and collaborate. My code certainly exercises uncertainties, Pint, and Pint-Pandas!

Numpy functions are a long way off! but I suspect you can do similar to how pint does with __array_function__ to get desired results

@MichaelTiemannOSC
Copy link
Collaborator Author

xref: pandas-dev/pandas#14162, pandas-dev/pandas#20993, and pandas-dev/pandas#22290. The first two especially give strong encouragement to bring uncertainties into Pandas. I think that many of the 2.x changes make this much easier, though no clue yet on how to deal with the uncertainties package in CI/CD. I think a strong proof of concept in Pint-Pandas could break through into Pandas (which would be a win for all).

@andrewgsavage
Copy link
Collaborator

None of those issues imply the UncertaintyArray should live in pandas! Just that one should be made. Indeed in the first linked issue

better yet would be to write and ExtensionArray to have first class support in pandas.

I'd be very suprised if they'd merge it into pandas, hence my uncertainties or uncertainties-pandas suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pint-Pandas support for uncertain Quantities
2 participants